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Bayesian Variable Selection with Structure Learning: 
Applications in Integrative Genomics 
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and Veerabhadran Baladandayuthapani^. 

Abstract. Significant advances in biotechnology have allowed for simnltaneous measnre- 
ment of molecular data points across multiple genomic and transcriptomic levels from a 
single tumor/cancer sample. This has motivated systematic approaches to integrate multi¬ 
dimensional structured datasets since cancer development and progression is driven by nu¬ 
merous co-ordinated molecular alterations and the interactions between them. We propose 
a novel two-step Bayesian approach that combines a variable selection framework with in¬ 
tegrative structure learning between multiple sources of data. The structure learning in the 
first step is accomplished through novel joint graphical models for heterogeneous (mixed 
scale) data allowing for flexible incorporation of prior knowledge. This structure learning 
subsequently informs the variable selection in the second step to identify groups of molecular 
features within and across platforms associated with outcomes of cancer progression. The 
variable selection strategy adjusts for collinearity and multiplicity, and also has theoretical 
justifications. We evaluate our methods through simulations and apply them to a motivating 
genomic (DNA copy number and methylation) and transcriptomic (mRNA expression) data 
for assessing important markers associated with Glioblastoma progression. 

Key words: Bayes variable selection; collinearity; gene networks; graph structured covariates; mixed graphical 
models; prior knowledge incorporation. 
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1. INTRODUCTION 


The last decade has seen a proliferation of genomic data, aided partly by the rapid 
evolution and declining costs of modern technologies, producing high-throughput multi¬ 
dimensional data. It is now technologically and economically feasible to collect diverse data 
on matched patient/tumor samples at a detailed molecular resolution across multiple modal¬ 
ities such as genomics (DNA copy number and methylation), transcriptomics (mRNA/gene 
expression) and proteomics. Such large scale co-ordinated efforts include worldwide con¬ 
sortiums such as the International Cancer Genome Consortium (ICGC; icgc.org) and The 
Cancer Genome Atlas (TCGA; cancergenome.nih.gov), which have collated data over mul¬ 
tiple types of cancer on diverse molecular platforms, to accelerate discovery of molecular 
markers associated with cancer development and progression. 

Initial studies in cancer genomics relying on single platform analyses (mostly gene expression- 
and protein-based) have discovered multiple candidate “druggable” targets such as KRAS 
mutation in colon and lung cancer (Capon, et al., 1983; Shimizu, et al., 1983), BRAF in 
colorectal, thyroid, and melanoma cancers (Davies, et ah, 2002), and PI3K in breast, colon 
and ovarian cancers (Samuels, et ah, 2004; Bachman et al., 2004; Campbell et ah, 2004). 
However, it is believed that integrating data across multiple molecular platforms has the 
potential to discover more co-ordinated changes on a global (unbiased) level (Chin et. al, 
2011). Through data integration, we espouse the philosophy that cancer is driven by nu¬ 
merous molecular/genetic alterations and the interactions between them, with each type of 
alteration likely to provide a unique but complementary view of cancer progression. This of¬ 
fers a more holistic view of the genomic landscape of cancer, with increased power and lower 
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false discovery rates in detecting important biomarkers (Tyekucheva et al., 2011; Wang et. 
al, 2013), and translating to substantially improved understanding, clinical management and 
treatment (Hamid et. al, 2009). Integrating data across diverse platforms has sound bio¬ 
logical justifications due to interplay between diverse genomic features. Between platforms, 
attributes at the DNA level such as methylation and copy number variation can affect mRNA 
expression, which in turn is known to influence clinical outcomes such as cancer progression 
times and stage/pathology of the tumors. Within platform interactions arise from pathway- 
based dependencies as well as dependencies based on chromosomal/genomic location (an 
example being copy number data). 

Statistically, the concept of data integration can be very broad depending on the scientific 
question of interest such as prediction, classihcation, variable selection, or clustering (or 
combination of these) - and is an area of active investigation. Lanckriet et al. (2004) 
propose a two stage approach, hrst computing a kernel representation for data in each 
platform and subsequently combining kernels across platforms in a classihcation model. Shen 
et. al (2013) proposed a classihcation model “iCluster”, which uses a joint latent variable 
model to integrate data across multiple platforms. Tyekucheva et al. (2011) propose a 
logistic regression model regressing a clinical outcome on covariates across multiple platforms. 
Recently Wang et al. (2013), Jennings et al (2013) proposed integrative Bayesian analysis 
of genomics data (iBAG, in short), which models biological relationships between genomic 
features from multiple platforms, and subsequently uses the estimated relationships to relate 
the platforms to a clinical outcome. Lock et al. (2013) propose an additive decomposition 
of variation approach consisting of low-rank approximations capturing joint variation across 
and within platforms, while using orthogonality constraints to ensure that patterns within 
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and across platforms are unrelated. 

Except for IBAG, the above approaches are not equipped to exploit the information 
garnered from data integration in relating the platforms to the clinical outcome - which is 
usually the goal of translational research in Ending markers of cancer progression. Unfortu¬ 
nately, iBAG assumes independence between genes in discovering mechanistic relationships 
between platforms at a gene-centric level. It is well-established that genes that promote or 
inhibit tumor development function in co-ordination within modules, such as functional or 
cell signaling pathways (Boehm and Hahn, 2011). In this article, we propose an approach 
that allows for a broad network of interactions within and between platforms (and genes) 
through a graphical model approach, and subsequently ensures that associated sub-groups 
of features under the estimated graph influence the outcome in a coordinated manner via a 
structured variable selection step. Figure 1 provides a conceptual schematic of our approach. 

Our methods are motivated by a TGGA based study focusing on discovery of impor¬ 
tant molecular markers for progression times in glioblastoma patients by integrating diverse 
platform-specific features on different scales: discrete (copy number variation) and continu¬ 
ous (DNA methylation and mRNA expression), measured on the same set of genes. In ad¬ 
dition, there exists substantial prior knowledge on pathway (graphical) interactions between 
these genes (e.g. from databases such as KEGG, Gene Ontology or from other prior studies), 
the incorporation of which can result in improved estimation and refined biological interpreta¬ 
tions. Thus integrating information across these platforms requires us to develop a graphical 
modeling approach which estimates the graph for mixed data (from different scales), while 
being able to incorporate prior graphical knowledge. Different platforms might be poten- 
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Figure 1: A schematic diagram of our integrative modeling approach. Panel (a) shows the heatmaps 
of the genes by sample matrix constructed from data for three platforms; panel (b) depicts the prior 
graph constructed using previous studies; while panel (c) is the estimated graph of the genes within 
and across the platforms. The dashed arrows determine graphical structure and the solid arrows 
represent the regression model incorporating graphical dependencies. Red and green lines in panel 
(c) represent high negative and positive partial correlations under the estimated graph, while all 
other edges with lower absolute partial correlations are depicted with watermark lines. 
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tially informative about different sets of within platform gene-gene interactions which are 
not evident based on single platform analysis, while clusters of features within and across 
platforms are expected to work together to drive the cancer mechanism, thus rendering the 
usual variable selection approaches assuming independence between covariates inadequate. 

There is an increasing variable selection literature for structured covariates lying on a 
known graph, which mirrors the growing recognition that incorporation of supplementary 
covariate information in the analysis of genomic data can be instrumental in improved infer¬ 
ences (Pan et ah, 2010). Such developments have been aided by a proliferation of genomic 
databases storing pathway and gene-gene interaction information (Stingo et. al, 2011). Some 
examples of Bayesian regression and variable selection approaches for graph structured co¬ 
variates include Li and Zhang (2010), Stingo et al. (2011), and Rockova and Lesaffre (2013). 
However, the focus of the above approaches has been on single platform regression with 
covariates lying on a known graph specihed a priori, which is very different from the data 
integration component in the present work, whereby we seek to achieve the dual goal of learn¬ 
ing the graph for mixed covariates while incorporating prior knowledge, and subsequently 
using such structure learning to relate the covariates to the outcome. We note that naively 
using a hxed known graph may not be practical in genomic studies, where the network of 
gene-gene interactions is likely to vary over different biological or experimental conditions. 
In such cases, mis-specihcation of prior information can result in inferior quality inferences. 

We propose a principled two stage approach which estimates the graph in the first step, 
and uses this point estimate to inform the subsequent variable selection step. Concisely 
stated, the major novelties presented in our article are: (i) estimating graphical models for 
mixed data, while incorporating prior graphical knowledge and controlling the degree of con- 
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fidence one wishes to place on the prior graph through a belief parameter. The graphical 
modeling approach for mixed data models ordered categorical variables by rounding contin¬ 
uous latent variables (Albert and Chib, 1993; Kottas, 2005; Jara et. al, 2007; Canale and 
Dunson, 2012), and specihes a graphical model jointly on the observed continuous covariates 
and the latent continuous variables; (ii) using the estimated graph to dehne (potentially 
overlapping) groups of molecular features within and across platforms, which work together 
in driving the outcome; and (iii) developing a structured variable selection approach, which 
is designed to control collinearity by simultaneously including all variables belonging to each 
aforementioned group, while having sufficient control over false positives by specifying appro¬ 
priate multiplicity adjusted priors. The proposed approach generalizes the afore-mentioned 
approaches for graph-structured covariates assuming a fixed known graph when the belief 
parameter is large (~ cx)), and corresponds to an unsupervised approach with complete lack 
of prior graphical knowledge when the belief is small (~ 0). In addition, to our knowledge, 
ours is the hrst Bayesian graphical model approach to address mixed data. 

Section 2 describes the methodology, section 3 outlines the posterior computation scheme, 
section 4 provides theoretical justihcations for our approach, section 5 lays out simulation 
studies, section 6 applies our approach to genomic data for individuals with glioblastma 
multiforme. A conclusion section summarizing the main results is provided and the Appendix 
contains proof of the Theorems and posterior computation steps. 

2. METHODOLOGY 

Notations: In this article, we focus on a univariate continuous response, y G 3ft, to be re¬ 
gressed on a p-dimensional vector of mixed covariates x = [xi,..., x^*] having an underlying 
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graphical structure which is to be estimated, while incorporating prior graphical knowledge 
Gq. Here Xj is the covariate vector corresponding to the j-th platform which has pj features, 
j = 1,, D*, so that — P- The discrete covariates (e.g. copy number) in the mixed 

vector X are allowed to be binary or ordered categorical. 

Let us denote the n x 1 vector of responses as and the n x p dimensional covari¬ 
ate matrix as X = [Xi,..., We construct a joint probability model employing a 

two stage strategy, and based on the factorization P{Y^,X) = P{X)P{Y"'\X). In the 
hrst structnre learning stage, we use the posterior P(H|X,Go) oc P(X|H)P(H|Go) to esti¬ 
mate the p X p precision (inverse covariance) matrix hi for the covariates, nnder a con- 
tinnous shrinkage prior P(r2|Go) incorporating prior graph knowledge Gq. The precision 
matrix is used to obtain estimates of the graph G using a post-MCMC step as described 
in the seqnel. In the second step, called strnctured variable selection, G is used to in¬ 
form the regression model throngh the prior on the model space, i.e. we specify the likeli¬ 
hood P{Y^\X, My, 0^l^)P(M-y|G, 0^l'’^)P(©^l^|G), where denotes a model in the model 
space r and is the list of regression parameters. The goal of the second step is to 

propose a variable selection model designed to control collinearity by incorporating the in¬ 
formation represented by the graph structnre on covariates, while having snfficient control 
over multiplicity through multiplicity adjusted priors. We discuss these steps in detail below. 

2.1 Integrative Structure Learning using Mixed Graphical Models 

Recalling that we are dealing with mixed covariates x = [xi,... ,xd*] across diverse plat¬ 
forms, we propose a graphical model approach for mixed data designed to integrate and hnd 
inherent strnctnres within and across mnltiple platforms. The graphical modeling approach 



involving the vertex set V = {1,... ,p} and edge set E, serves the following two purposes: 
(1) model dependence between features within and across platforms and genes - in our ap¬ 
plication, measurements for different platforms are available for the same set of genes, so 
that the joint modeling across platforms allows for both cis-acting (localized to a gene) and 
trans-acting (across gene locations) ; and subsequently (2) use the graph to detect subgroups 
of features within and across platforms, as well as gene locations, which dehne functional 
modules that are related and potentially work together to drive cancer progression. Such 
modules correspond to cliques, which are dehned as a subgroup of V such that each node in 
this subgroup is connected to every other node in the subgroup. 

The mixed data formulation: Without loss of generality, let Xj = [xii, ..., Xio*] = 
(xp,xP) denote the covariate vector for the i-th subject, with the superscripts C,0 de¬ 
noting continuous and ordinal (and/or binary) covariates respectively. Let denote the 
generic notation for the latent continuous variable corresponding to ordinal predictor x^, 
and consider the following graphical model for mixed covariates 

xfj = I, Di_i <= zfj < Di, -oo = Do < Di < ... < Dm, = oo, j = l,...,po, 

z? = (xp,zP) ~ iV[^](0,ff^), ff~7r(ff), i = (1) 

where denotes a Gaussian distribution with truncated domains for z'^, Mo is the num¬ 
ber of ordinal levels, po is the number of ordinal covariates, and G ~ 7r(r2) corresponds to a 
continuous shrinkage prior on the p x p precision matrix to be described in the sequel. To 
keep things simple, the prior on the precision does not incorporate prior graphical knowledge 
Go at this stage, which is to be introduced in the next section. Instead of specifying shrinkage 
priors, one can alternatively choose a discrete mixture specihcation (Dawid and Lauritzen, 
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1993;Tallurietal.,2O13)in0as: (xf, zf) ~ 7Vp](0 ,Og ~ 7r(0|G), G ~ 7r(G), i = 
Such discrete mixture approaches would place exact zeros for off-diagonals with 
absent edges corresponding to G, while maintaining positive dehniteness of flc and simulta¬ 
neously specifying a prior G ~ 7r(G). However, we adopt a continuous shrinkage approach, 
as it provides us an avenue for incorporating prior knowledge in a seamless manner as well 
as achieving sparsity. Under the generic continuous shrinkage specihcation 0, the posterior 
is given by 

n ( po Mo 'I 

P(U, zf,..., Z^IX) oc 7r(U) n ^ n E ^ 0; 

i=l 1=1 J 

from which MCMC samples arise. Subsequently a post-MCMC step can be implemented 
in order to obtain the graph estimate G by thresholding absolute partial correlations corre¬ 
sponding to the estimated precision matrix H, as elaborated in Section 3. 

Unfortunately, in spite of a rich literature on Gaussian graphical models, there seems to be 
limited development of graphical modeling approaches incorporating prior knowledge. The 
most common approach seems to be incorporating the prior graph via informative priors 
on edge inclusion probabilities (Baladandayuthapani et. al, 2014), with Mukherjee and 
Speed (2008) providing an alternative approach. In general, incorporating prior knowledge 
for all edges through a discrete mixture approach can run in to potential difficulties, since 
hnite runs of the MCMC are not able to update a sizable proportion of the edges even for 
moderate dimensional graphs. In such a case, these edges will not be updated at all, and 
will instead correspond to the initial choice of the adjacency matrix (Kundu et ah, 2014). 
This violates our objective of incorporating prior knowledge while learning all possible edges 
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of the graph from the data, with such discrete mixture approaches potentially resulting in 
graphical estimates which are increasingly sensitive to the initial choice of the adjacency 
matrix for higher dimensions and hnite lengths of MCMC chains. 

The above considerations motivate us to use a continuous shrinkage approach for graph¬ 
ical models, which is equipped to incorporate prior graphical knowledge by using a novel 
procedure involving a belief parameter. Unlike discrete mixture approaches, the continuous 
shrinkage based approach does not depend on an initial adjacency matrix, and can update all 
elements of the precision matrix at every iteration, thus utilizing the available prior knowl¬ 
edge on all edges to drive inferences on the graph. 


2.2 Incorporating Prior Graph Information 

We now describe the graphical prior incorporating prior knowledge. Let Go be the prior 
graph having vertex set V = {1,... ,p} and edge set Eq, with the corresponding adjacency 
matrix Aq = (ao,p), where ao,jj = 1 if edge (f, j) is present and 0 otherwise. Throughout this 
article, we consider undirected graphs so that ao,p- = aoji for all (f, j). In order to incorporate 
prior knowledge Gq, we develop the following novel generalization of the Bayesian graphical 
lasso (Wang, 2012) 


7r(U|T) 

7r(T|A) 

Pij 


n 71= n 7 exp(-^o;,,)loeM+, 

i<j i=i ^ ^ 


oc 


72 

A?. 


i<j 




oc Xij\Go{l-pij)Ga{Kij+ ax,bx)+PijGa{ax,bx), 


Kij Qp, (1 ClQpj')Kij bp), 


( 2 ) 


where is the set of positive dehnite matrices. A, r, are vectorized parameters with 
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dimension p{p + l)/ 2 , and Kij is the belief parameter for edge (i,j) = Kji,i ^ j, under 

an undirected graph). In Q, 7 r(ff|T) and 7 r(T|A) specihy a double exponential prior Uij ~ 
DE(Ajj),i 7 ^ j and an exponential prior on the diagonals Uu ~ Exp(Aii). The novelty of 
our approach involves incorporating prior knowledge Gq through the shrinkage parameter 
A via the belief parameters k. Specihcation of different beliefs for different edges can be 
useful when subject knowledge or historical information points strongly towards presence or 
absence of certain edges (gene-gene interactions in our application), but is ambiguous about 
others, as often happens in genomic studies. 

Role of belief parameter: Through the use of a belief parameter, we can control the degree 
of conhdence we place on the available prior graph information. This is a useful feature in 
enabling investigators to be flexible i.e. either skeptical or fairly conhdent about the prior 
knowledge, thus providing them with a range of alternatives in situations where there is 
reason to suspect mis-specihcation. To understand the role of the belief parameter in prior 
specihcation, observe that 


T ®p)/(^p' T 0,p T ^p)t — (1 Pij^^Kij Clx)/b\ -|- PijCl\/b\, (3) 

where i 7 ^ j = 1,... ,p. When ao,p' = 1 (i.e. Gq suggests presence of an edge), E{pij) ^ 1 
for large values of Hij with nij » bp. In extreme case when Kij —)■ 00 , we have pij ~ 1 
which implies E{Xij) ^ ax/h\ ~ 0 for appropriate values of ax,bx- Again for a large Kij and 
for ao,ij = 0 (i.e. Go suggests absence of an edge), we have E{pij) ^ 0. In the extreme case 
when Kij —)■ cx), we have pij ~ 0 which implies E{Xij) ^ {nij + ax)/bx ^ 00 . In summary, 
a large value of Kij encourages the presence or absence of an edge through a small or large 
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value of Xij, according to ao,jj =1 or 0 in the prior graph. In the limiting case as k —>■ oo, 
the realizations under formulation ([^ will become increasingly degenerate at the prior graph 
Gq. On the other hand when we are skeptical of the prior graph knowledge, one can simply 
choose K 0 which reduces formulation (|^ to Xij ~ Ga{ax,bx), thus collapsing to the set-up 
with no prior knowledge. These two scenarios correspond to two extremes of regression for 
graph structured covariates in supervised and unsupervised settings, assuming a fixed known 
graph and a complete lack of prior knowledge respectively. 

Calibration of belief parameter: In practice we can use E{pij) in (j^ to quantify the degree 
of conhdence on the presence of an edge specihed by prior knowledge. On rearranging the 
terms in (|^ one has 

Kij = bp/{l - E{pij)) - {ap + bp), when oo,p- = 1, Kp = ap/E{pij) - {ap + bp), when ao,p = 0. 

Thus, if Go suggests presence of the edge {i,j), and we are 50% sure of the correctness of 
such prior knowledge, we can substitute E{pij) = 0.5 in the above expression to obtain Kij = 
bp — Qp. In practice, one can choose a suitable combination of n^j, ap, bp to reflect appropriate 
conhdence. Figure 2 presents a plot of the belief parameter versus the degree of conhdence for 
the scenarios when the prior graph suggests presence (ao,p = 1) and absence (ao,p- = 0) of the 
(i,j)-th edge. From the plot, one can categorize diherent regions of conhdence: for ao,p = 1, 
low conhdence corresponds to E{pij) < 0.5(—1 < Kij < 0), moderate conhdence corresponds to 
0.5 < E{pij) < 0.7(0 < Kij < 1.34), and high conhdence corresponds to E{pij) > 0.7(Kjj > 1.34). 
Similar regions can also be constructed for ao,p = 0. 

2.3. Regression and Structured Variable Selection 
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(a) 



Confidence of presence of edge when prior graph has edge 


(b) 



Confidence of presence of edge when prior graph does not have edge 


Figure 2: Calibration of belief parameter. The plot shows the range of the the belief parameter 
(y-axis) versus degree of confidence (x-axis). The three bins demarcated by vertical blue lines 
depict regions of low (—1 < k < 0), moderate (0 < k < 1.34) and high confidence {k > 1.34) from 
left to right. 


Through the above steps we obtain an estimate of the graphical structure on the covariates 
represented as G. In the second step, called structured variable selection, we incorporate 
the structural knowledge represented by the estimated graph in regressing the outcome of 
interest on covariates. Although we consider continuous outcomes, it is straightforward to 
extend our approach to binary or ordinal outcomes via thresholding the latent continuous 
variables (see citations in Introduction). We consider the following linear regression model 


= aln + ^yl3j + e, ej~A(0,7y ^), Tr{a,r]) (x l/rj, 

Pi ~ , 7 r( 5 r) = ^^^( 1 + 5 r)"t, 7 ~ 7 r( 7 |G), f = l,...,n. (4) 

Here 7 = { 7 j,j = l,...,p} G F (the model space) is the vector of variable inclusion 
indicators, with 7 ^- = 1 if the jth candidate predictor is included in the model and 7 ^ = 0 
otherwise, is the P 7 x 1 vector of the regression coefficients with Ij being the 

size of model 7 , X 7 is the n x p^ covariate matrix (excluding an intercept) containing the 
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predictors in model 7 and having the i-th row as x^ j, and a,ri are the intercept and the 
residnal precision respectively. We address nncertainty in snbset selection throngh 7 r( 7 |G) 
depending on the estimated graph structnre on the covariates, while mixtnres of ^f-prior 
(Liang et ah, 2008) on characterizes the prior knowledge of the size of the coefficients for 
the selected predictors. The tnning parameter g is assigned the hyper-^f prior with a = 4. 

The prior on the model space 7 ~ 7 r( 7 |G) is dehned using clique indicators. Let 
Ci,...,Cg, denote the cliques identihed by the estimated graph G. The cliques are in¬ 
dicative of (potentially overlapping) groups of associated genetic features within and across 
platforms and gene locations. The variable selection approach espouses the philosophy that 
features in each clique are indicative of functional modules which work in coordination to 
drive the outcome. To this end, clique inclusion indicators are designed to include all fea¬ 
tures in a clique simultaneously if that clique is signihcantly associated with the outcome. 
Denote inclusion indicators for the k-th clique as yc^., k = 1,... ,q. We dehne the prior on 
the model space through clique inclusion probabilities as follows 

P( 7 Cfe = lIG) = tt, tt ~ Pe(a^, 6 ^), (5) 

where vr controls the sparsity of the model through the Beta hyperprior, with the resulting 
formulation achieving multiplicity control over the selection of cliques (Scott and Berger, 
2010). The above approach is designed to protect against collinearity by including a par¬ 
ticular feature if any one of the cliques to which it belongs is signihcant. This implies 
P( 7 j = 1) = P(U^g^^ {iCk = I}); where is the set of all cliques containing the j-th node. 
We call the resulting approach in (Q-(|^ Bayesian variable selection with structure learning 
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Figure 3: Directed acyclic graph of proposed model. The indices run : i ^ j = 1,... ,p,k = 1,... ,q, 
with q being number of cliques in the estimated graph. Solid boxes, circles, dashed boxes, represent 
observed data, model parameters, and model hyperparameters respectively. 

(BVS-SL), a schematic representation of which is presented in Figure 3. 

In practice, one can specify different inclusion probabilities tt*,, fc = l,...,g, in ([^ to 
account for the clique size, so as to encourage or discourage inclusion of large cliques. However 
our approach is designed to produce sparse graphical estimates, so that the clique sizes are 
expected be small and would not exhibit high variability, with such sparsity assumptions on 
the graph being pervasive in statistical literature. The simultaneous control of collinearity 
for features within cliques and multiplicity control over different cliques is designed to attain 
a desirable balance between detecting true positives and true negatives, a claim which is 
supported by our simulation studies. 

By computing the marginal inclusion probabilities of cliques, we are in a position to 
detect such subgroups which significantly affect outcome of interest. In addition, individual 
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significant covariates can also be detected by compnting marginal inclusion probabilities 
as described previously. In the scenario when all the cliques are disjoint with q < p, the 
formulation reduces to a clustering approach with all features within a certain cluster either 
simultaneously included or excluded, while allowing the within cluster hxed effects to be 
numerically different. In the limiting case, when each of these disjoint cliques has high 
positive pair-wise correlations tending to 1, our approach reduces to the traditional clustering 
approach with the within cluster fixed effects becoming numerically similar. In the special 
case when q = p, the above formulation reduces to the usual stochastic search variable 
selection (SSVS) approach. Finally, we demonstrate that the variable selection approach has 
appealing theoretical justifications under suitable assumptions as highlighted in Section 4. 

3. Posterior Computation 

The proposed approach contains two distinct sets of posterior computation, one for the 
graphical model estimation part, and another for the structured variable selection approach. 
The graphical model estimation for mixed covariates proceeds via sampling the latent un¬ 
derlying continuous variables corresponding to the ordered discrete covariates, followed by 
drawing the joint precision matrix of (x‘",z‘^) under formulation (|^. We adapt the proce¬ 
dure in Johnson and Albert (2001) to the case of dependent covariates, for posterior updates 
of the latent continuous covariates under the following posterior distributions 


= /, A-IJ, Dij ~ (-i)), ~ Unif(2;('^., 


0|„c „0/ 




.L U - 


where z^{—j) represents the vector of latent underlying variables for the i-th subject and 
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excluding the i-th measurement, zh = max and zV, = min 2 ;,^ for / = 1,M — 1. 

■ O 1 • O 7 I 1 

i:xy=l t:xy=l-\-l 

tj ij ' 

Once the latent variables have been updated at each MCMC iteration, we sample O, r using 
the method described in Wang (2012), while is updated by introducing an auxilary binary 
variable 5ij ~ Bernoulli(pjj), and using the posterior 


7r(Ajj|—) ~ Ga{l + a\, \u}ij\ + b\)l{5ij — 1) + Ga{l + kij + ax, |wp| + bx)l{6ij — 0), 


where l(-) is an indicator function and pij is drawn from a Beta posterior. The point 
estimate of the graph is obtained as a post-MCMC step by thresholding the estimated 
partial correlations. In particular, the (i,j)-th edge is included if and only if pij/E\x;{pij\X.) > 
0.5, where pij is the posterior partial correlation estimate under the continuous shrinkage 
approach, and E\y{pij\X.) represents the posterior mean of the partial correlation under the 
reference distribution W =Wishart(3, Jp). The posterior computation steps for structured 
variable selection conditional on G are presented in the Appendix A.3. 

4. THEORETICAL JUSTIFICATIONS 

In this section, we establish variable selection and prediction consistency results for the 
proposed structured variable selection approach. Model/variable selection consistency im¬ 
plies increasing posterior probability to the true model as the sample size increases, i.e. 
as n —)■ cx), where the true model in our case is given by 

; W = a-h + e, ~ A(0,77“^), Xj ~ A(0, fig ^), i = 1 ,..., n. ( 6 ) 

For prediction consistency, we would like the predicted outcome to be close to the true 
expected value with increasing certainty. Given the covariate vector x*, and the data (U"', X), 
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the predicted value as defined in Liang et. al (2008) is the optimal point estimator under 
the squared error loss obtained by Bayesian model averaging 


roo 

r = a+Y, X) / Y^, X)dg, 

do 9 

where a,/3y are the ordinary least squares estimates for model M^. We call the proposed 
approach consistent under prediction if P{y* = i?o( 2 /*)|X) —)■ 1 as n —>■ oo under the true 
sampling distribution (i.e. when a, is known), where Eo{y*) = a + is the 

mean in the true model. Denote Mjv as the null model. We have the following result under 
regulatory conditions (A1)-(A3), which are presented in Appendix A.l, with the proof in 
Appendix A.2. 

Theorem 1 Under assumptions (Al)-(A3) and for the class of models defined by model 
selection consistency and prediction consistency hold under the true model M^). 


5. SIMULATIONS 

The goal of our simulation study is to assess the variable selection and prediction per¬ 
formance for the proposed approach under several scenarios with varying dimensions and 
association structures for the covariates. We implement the proposed approach both with¬ 
out prior knowledge (BVS-SL) and with prior knowledge (prior corrected BVS-SL), where 
the prior graph is taken to be the true graph Gq used to generate the data. The same value of 
the belief parameter (50) was used for all edges corresponding to strong confidence; we also 
examine the effect of varying the belief parameter as explained in the sequel. We compare the 
proposed approach to stochastic search variable selection (SSVS) (George and McCullough, 
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1993) assuming independence of predictors, the penalized joint credible regions approach 
(PenCred) by Bondell and Reich (2012), which takes in to account dependence amongst 
predictors, as well as frequentist regularization approaches such as Lasso (Tibshirani et ah, 
2005) and elastic net (Zou and Hastie, 2005), using R packages ‘lars’ and ‘elasticnet’ re¬ 
spectively. To determine the cliques for an estimated graph G under our approach, we use 
the ‘igraph’ package in R. The number of MCMC iterations implemented for Bayesian ap¬ 
proaches was 10,000, with a burn in of 3000. The training and test sample sizes were 100 
each, and we consider p = 24,40, 80. All results are reported over 20 replicates. 

The data was generated from a linear regression model having p covariates out of which 
nine were ordinal (generated by thresholding the continuous latent variables) and taking 
values 0-4, and the rest were continuous. The true inclusion status is set to 7 ° = l,j = 
1,..., 8,23, 24, with four discrete variables included, and 7 ° = 0 otherwise. The contin¬ 
uous covariates and the continuous latent variables for discrete covariates were generated 
using a multivariate Gaussian distribution with covariance E't. We consider different block- 
diagonal structures for Sy (listed hereafter), specifying subgroups of predictors with varying 
partial correlations. The true graph Gq was obtained by including all edges {i,j) with 

> 0 . 0001 . 

Case 1(a): This case corresponds to high partial correlations with the precision matrix 
having four sub-blocks and all precision diagonals being 1. The hrst sub-block (4 x 4) has 
off-diagonals as 0.95, the second and third sub-blocks (4x4 each) have precision off-diagonals 
as 0.7, and the fourth sub-block {p — 12 x p — 12) is identity. The true coefficient vector was 
(0.3, -0.7,1.1, -0.05, 0.1,0.2, -1.2,1.5,0,..., 0,1, -1). 

Case 1(b): This case corresponds to high correlations with St having the same structure as 
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in Case 1(a). The coefficients were (0.3,0.7,1.1,0.05, —0.1, —0.2, —1.2, —1.5,0,..., 0,1, —1). 
Different signs for coefficients in cases I(a)-(b) ensures same signs for positively pair-wise 
correlated covariates. 

Case 1(c) : This case corresponds to a block diagonal with two sub-blocks - one having an 
AR(1) structure for the precision matrix with Sy^(i, j) = 0.95l*“-^l, f, j = 1, • • •, 8, and the other 
sub-block being identity. The coefficients were same as those in Case 1(a). 

Case 1(d): This case corresponds to having the same structure as in Case 1(c). The 
coefficients were same as those in Case 1(b). 

Ordering of Models: One can obtain an ordered sequence of regression models by varying 
the cut-off for the marginal inclusion probability under Bayesian approaches and varying 
the penalty parameter for frequentist approaches. For each point on the ordering, we can 
obtain specihcity = 1- false positive rate (FPR), and sensitivity which can be considered 
as the power to detect important predictors. To evaluate the ordering of the models, we 
look at receiver operating characteristic (ROC ) and precision recall characteristic (PRC) 
curves. For the set of ordered models, ROC curves plot sensitivity versus l-speci£city, while 
PRC curves instead plot the precision (ratio of true positives to the total number declared 
as positive) versus the recall or sensitivity. ROC curves present a picture of the trade-off 
between type I error and power, while PRC curves give a complementary picture, examining 
the trade-off between the power and false discovery rate (Davis and Goadrich, 2006). 

From the ROC and PRC curves presented in the Supplementary Material, it is clear 
that the curves for BVS-SL and the prior corrected BVS-SL, essentially always dominate the 
competing curves for all cases in higher dimensions. The area under ROC and PRC curves 
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presented in Tables 1-4 shows that the BVS-SL and prior corrected BVS-SL approaches 
essentially always perform better than the competitors for all dimensions, which indicates a 
superior model selection performance. From the Tables, BVS-SL demonstrates a uniformly 
higher power when the false discovery rate is controlled at 10 %, which points towards a 
superior performance in tackling collinearity for a given multiplicity level. 

Under the sample size and model dimensions we consider, the prior corrected BVS-SL 
may or may not perform better than it’s uncorrected counterpart. We expect that as n 
increases, the former will almost always outperform the latter for high values of the belief 
parameter, under no or minimal mis-specihcation of the prior knowledge. Sensitivity to prior 
knowledge is discussed in more detail in the sequel. 

Out of Sample Prediction: In addition to looking at the ordered sequence, it is certainly of 
interest to examine the predictive performance under each approach, as well as to assess the 
point estimate under the optimal model. The point estimate is selected using the Bayesian 
information criterion under BVS-SL, PenCred, Lasso, and elastic net, while the median 
probability rule is used for SSVS. We report the model size (MS) and false positives (FP) 
under the point estimate. This point estimate is also used for prediction under PenCred, 
Lasso, and elastic net, while the posterior predictive distribution is used under BVS-SL and 
SSVS. We look at the predictive performance in terms of out of sample mean squared error 
(RMSPE) and out of sample coverage of 95% predictive intervals (COV 95 ). The coverage 
refers to the proportion of test sample values contained within predictive intervals. The 
predictive intervals correspond to credible intervals for the Bayesian approaches BVS-SL 
and SSVS, whereas for PenCred, as well as the frequentist approaches, they correspond to 
pseudo conhdence intervals that are constructed as (x/3 — 1.96(To,x/3 -|- 1.96(To), where ao is 
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the true residual variance. 


It is seen from Tables 1-4 that while the proposed approach has comparable performance 
in terms of out of sample prediction and 95% coverage for lower dimensions, the relative 
performance with respect to competitors improves signihcantly as the number of noise pre¬ 
dictors is increased. The number of true covariates (MS - FP) detected under the proposed 
approach, as well as the coverage, is essentially always the best or the second best among 
all the approaches considered. We also see that while the SSVS might have an advantage 
compared to BVS-SL in terms of controlling false positives, the prior corrected BVS-SL 
essentially has similar or better multiplicity control compared to SSVS. 

An important observation is that the proposed approach seems to have a distinct advan¬ 
tage over the competitors for a large number of noise variables, a scenario often encountered 
in high-dimensional genomics applications. In particular, as the number of noise variables is 
increased: (i) the out of sample coverage under the proposed approach does not change sig¬ 
nihcantly, whereas the coverage for other competitors (except SSVS) decreases; and (ii) the 
competing approaches (except SSVS) demonstrate the well-known multiplicity problem, reg¬ 
istering an increasing number of false positives. On the other hand, the SSVS demonstrates 
drawbacks in higher dimensions in terms of collinearity, as evidenced by smaller model sizes, 
and poor power to detect true positives for a given level of false discovery. 

Sensitivity to Prior Knowledge: To evaluate the sensitivity of the prior corrected BVS- 
SL to prior graphical knowledge, we examined the performance when the prior knowledge 
deviates from the truth. In particular, we looked at the results when the available prior 
graph information has p, 2p, 3p mis-specihed edges, and compared it with the scenario when 
the prior graph was the truth. Our analysis indicate that (i) when the mis-specihcation error 
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The area under ROC Curve for Model I (b) 


The area under PRC Curve for Model I (b) 




Figure 4: Plots for area under receiver operating characteristic curve and precision recall curve, 
for varying belief parameters. 


rate is low, the prior corrected BVS-SL has superior performance compared to BVS-SL for 
high values of the belief parameter, and (ii) when the mis-specihcation rate is high, the prior 
corrected BVS-SL can perform poorly for high values of the belief, but performs similarly 
to unsupervised BVS-SL in uninformative settings involving low values of belief. This is 
consistent with our expectations in the sense that we can protect against mis-specihcation 
through low values of the belief parameter, while hoping to achieve signihcant improvements 
through high values of the belief when the available prior graph knowledge is true or close 
to the truth. In applications, we expect the belief parameter to be chosen by experts using 
subject knowledge, or alternatively one can look at a series of beliefs. Figure 4 shows the 
plot of area for case 1(b) under the ROC and PRC curves using our approach incorporating 
true prior knowledge, and a range of values of the belief parameter. It is clear that under 
a true prior graph, the area under the curve for prior corrected BVS-SL is a non-decreasing 
function of the belief parameter. 
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6. APPLICATION TO INTEGRATIVE GENOMICS 


6.1. Data Description 

Our motivating dataset arises from a TCGA-based study in glioblastoma multiforme (GBM), 
which is the most common and aggressive form of primary brain cancer in human adults 
(Holland, 2000). The TGGA data portal (http://tcga-data.nci.nih.gov/tcga/) provides multi¬ 
ple levels of molecular data for a large cohort of GBM tumor specimens. These specimens 
underwent rigorous quality control procedures via the TGGA Biospecimen Gore to ensure 
high-quality DNA and RNA extraction. Each qualihed specimen was assayed using multiple 
assays among which we concentrate on the following: messenger RNA (mRNA) expression 
using HT-HG-U133A (Affymetrix) arrays, DNA methylation (METH) using HumanMethy- 
lation27K (Illumina) and DNA copy number (GN) HG-GGH-244A (Agilent) arrays. All the 
resulting data from the three platforms are pre-processed, normalized and annotated to the 
gene level, and referred to as Level 3 data (see Wang et ah, 2013 for details). We focus our 
analysis on 48 genes that overlap with the three critical signaling pathways - RTK/PI3K, 
p53, and Rb, which are involved in migration, survival and apoptosis progression of cell 
cycles (Furnari et ah, 2007). Thus our covariate matrix consists of 48 genes mapped to these 
core pathways from D* = 3 platforms (mRNA, METH, GN) resulting in p = 48 x 3 = 144 
regressors. Note that mRNA and METH are continuous, while GN is converted into cate¬ 
gorical via thresholding, having three categories corresponding to loss, gain, or neutral. The 
outcome is log-transformed survival times for 163 uncensored patients which is regressed on 
the covariates using an accelerated failure time model. An integrative study is required for 
a better understanding of the interactions between these core pathways, and is expected to 
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provide new insights into the progression of GBM (Verhaak et ah, 2010). A description of 
our pre-processing of the copy number data is provided in the Supplementary Materials S.l. 

The prior knowledge on the graphical structure between these 48 genes is based on prior 
studies in GBM (Gerami et ah, 2012), and is denoted as Go,pr (shown in panel (b) of Figure 
1). This prior graph is obtained by assessing sequence mutations, copy number alterations 
and proteins and conhrm and extend the observation that GBM alterations tend to occur 
within specihc functional modules. Note that while the prior graph has only 48 nodes, the 
graph in our analysis comprises 144 nodes, across the 3 platforms. We construct the prior 
graph for these 144 nodes as one which is designed to preserve the prior graphical structure 
Go,pr within the platforms, but specify null interactions between two distinct genes across 
two different platforms, while assuming the presence of an edge between different platforms 
at the same gene location. Thus the prior graph can be concisely written as: Go = Go,pr < 8 .^ 3 , 
where ® represents the Kronecker product of the two matrices. 

6.2. Analysis Results 

Survival Association: The marginal inclusion probabilities under our analysis are pre¬ 
sented in Table 0.1 in Supplementary Material, with a corresponding plot in Figure 5. Since 
a large number of features have inclusion probabilities greater than 0.5, we select the im¬ 
portant features using a false discovery rate criteria controlled at a pre-specihed level a, as 
detailed in Appendix A.4. Under a false discovery rate threshold of a = 0.2 (corresponding 
to a posterior probability threshold of 0.7), seven genes are significantly associated with pro¬ 
gression through various mechanisms: through copy number, we have HRAS, TP53, GGNDl, 
BRAF and GDKN2G; through methylation, we have GRB2 and through mRNA, MDM2. 
Of these GDKN2G and GRB2 are positive drivers of progression, while the remaining genes 
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are negatively associated with progression. HRAS is a member of the RAS oncogene family, 
whose negative effect on Glioblastoma is previously observed on the overall and progression- 
free survival (Serao et ah, 2011). Cyclin D1 (CCNDl) belongs to the Cyclin D family of 
cell cycle regulators, which are known to be up-regulated and amplihed in malignant glioma 
(Buschges et ah, 1999). Similarly, MDM2 the inhibitor of the tumor suppressor TP53, is 
established to be a candidate gene associated with short progression (Lukashchuk and Vous- 
den, 2007). TP53 copy-number itself is associated with poor progression. The effect is 
through the deletion of TP53, which is known be associated poor progression in GBM (Yin 
et ah, 2009). Although, there is no evidence of BRAF amplihcation in GBM, a previous 
study established that BRAF amplihcation via gene duplication event activates the MAPK 
signaling in low-grade glioma (Phster et ah, 2008). Gyclin-dependent kinase 4 inhibitor G 
(CDKN2C) is a well characterized tumor suppressor gene associated with many cancers and 
known to be deleted in Glioblastoma (Solomon et ah, 2008; Dunn et ah, 2012). On the other 
hand. Growth factor receptor-bound protein 2 (GRB2) is a key protein in epidermal growth 
factor receptor (EGFR and EGFRVIII) signaling in the Glioblastoma tumoroginesis path¬ 
way (Huang et ah, 2009). The positive association of progression with DNA methylation of 
this gene is interesting, as it essentially inhibits the expression of this key regulator, GRB2, 
in GBM. 

Clique Analysis: The cliques represent groups of associated features within and across 
platforms which inhuence the outcome in a coordinated manner. The important cliques are 
detected as those having highly signihcant marginal inclusion probabilities P( 7 c^|—),j = 
1,..., g. The clique analysis depicted multiple interesting interactions, although all the signif¬ 
icant cliques comprise only two-way interactions. In certain cases, the multiple cliques formed 
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Posterior Probabilities 



Figure 5: Marginal inclusion probabilities for each gene over three platforms. The probabilities 
are presented for the three platforms grouped by genes, with blue, red and black, implying copy 
number, mRNA expression, and methylation, respectively. 

with the same molecular probe with different partners have highly signihcant marginal in¬ 
clusion probabilities. For instance AKTl (METH) clique interaction with many different 
molecular probes is significant (refer Table 0.2 of Supplementary Material). These cliques 
constitute both tumor suppressing as well as activating interactions. The cliques involving 
AKTl (METH), PTEN (mRNA) and AKTl (METH), PIK3R2 (mRNA) can be construed 
as tumor suppressing, while cliques involving AKTl (METH), CCNDl (mRNA) and AKTl 
(METH), GRB2 (CN) probably are tumor activating. The diverse biological functionality 
of the cliques represent the inherent biological subtypes with in the GBM (Verhaak et ah, 
2010 ). 

Neighborhood Analysis: In addition to detecting important markers for GBM, we also 
examine the estimated graph (panel (c) of Figure 1) within and across platforms. The 
degree of connectivity for nodes with marginal inclusion probabilities > 0.5 is shown in 
Table 0.1 in Supplementary Materials. To further explore the biological ramifications of 
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Figure 6: Neighborhood analysis of GRB2 (mRNA expression) using Ingenuity Pathway Analysis, 
showing the central biological role in this molecular network as a trigger of the RAS signaling upon 
the activation of upstream receptor tyrosine kinase family members. 

the connectivity, the resulting neighborhood of GRB2 (mRNA) was analyzed using the 
IPA (Ingenuity Systems, www.ingenuity.com) to identify the biological interactions based on 
literature (Figure 6). A strong effect of DNA methylation is observed in the neighborhood 
along with the mRNA, while the effect of copy number alteration is more subtle. GRB2 plays 
a central biological role in this molecular network as a trigger of the RAS signaling upon 
the activation of upstream receptor tyrosine kinase family members. The presence of three 
important tumor suppressor genes of GBM: RBI, GDKN2B and PIK3GG is interesting, 
although they have no direct interaction with GRB2. RBI and PIK3GG seem to lose their 
functionality through DNA methylation, while GDKN2B through copy number loss, enabling 
the RTK/RAS activation cascade via GRB2. These events reinforce the previous illustration 
in GBM that hypermethylation and deletion of RBI and GDKN2B respectively contribute 
to the loss of tumor suppressor function (Nakamura et al., 2001; Rao et al, 2010). 

The partial correlations of genes between the platforms was further explored to analyze 
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Figure 7: Left heatmap: hierarchical clustering of correlation between the mRNA and copy number 
data; right heatmap: hierarchical clustering of correlation between the mRNA and DNA methyla- 
tion data. Green and red pertains to positive and negative partial correlations respectively. 

their clustering patterns so as to illustrate globally coordinated changes across platforms. 
The clustered partial correlations between the genes across the platforms are visualized using 
the next-generation clustering heatmaps in Figure 7. From the Figure, it is clear that there 
is a enrichment of positive correlations between the mRNA and copy number data, and an 
enrichment of negative correlations between the mRNA and DNA methylation data, which 
further supports our biological-hypothesis driven integrative models. 

7. DISCUSSION 

We have proposed a novel two-step Bayesian structured variable selection approach, which 
is equipped to learn the graphical structure from mixed data in the presence of prior knowl¬ 
edge, and subsequently uses such structure learning to inform variable selection in a manner 
that controls for collinearity, while having a desirable control over multiplicity. In this 
paper, we focussed on integrating (more upstream) copy number, mRNA expression and 
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methylation markers associated with cancer progression; however our methods can be easily 
extended to account for downstream post- transcription and translational events such mi- 
croRNA and proteomics markers. This will provide vital clues towards understanding the 
complete genomic landscape of cancer development and progression. We leave this task for 
future consideration. 
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Appendix 

A.l. Regulatory Conditions for Theorem 1: Let Mtv be the null model and let Go 

now represent the true graph and not prior graphical knowledge as in the paper. Consider 
the following regulatory conditions: 

(Al) For any model My/ which does not contain the true model / Mat, lim„_^oo n/^ 7 o^ 7 o(-^“ 
^7')"^7o^7o ^ h'y! G ( 0 ,oo), where Ry' is the projection matrix on to the span of X^/. 

(A2) The true graph Go has disjoint cliques Ci,... and the true model (non-null) 
admits the representation 70 = Ufcg 5 (,{ 7 oj ; j G Ck], for some 5o C { 1 ,..., go}. 

(A3) The graphical model estimation approach is consistent, i.e. P{G = Go|X) — ^ 1 as n —> 00. 

The first assumption is a standard one in parametric consistency literature for linear 
models (Liang et ah, 2008). The second assumption assumes non-overlapping cliques in Go 
and that the true model includes or excludes predictors belonging to a particular clique si¬ 
multaneously. In the context of genetic studies, disjoint cliques could correspond to distinct 
clusters of genes or pathways. The third assumption regarding graphical model consistency 
is a strong but necessary condition. In solving the challenging problem of jointly estimating 
the graphical model and the regression model, there is little hope of acheiving model selection 
consistency in the second stage in the absence of a consistent graphical model approach in 
the hrst stage. We note similar assumptions involving knowledge of the true graph structure 
have been previously made in frequentist supervised clustering and feature selection litera¬ 
ture (Shen et al. 2012). 

A.2. Proof of Theorem 1: Let Q denote the hnite graph space. First note that 

P(M^jy",X) = J]P(M^jy-,X,G)P(G|X) 

G&g 

= P(M^jF-,X,Go)P(G = Go|X)+ ^ P(M^jy-, X, G)P(G|X). 

Gegn{Go}" 
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For a consistent graphical model estimation approach such that P{G = Go|X) —)■ 1, we 


only need to show that P(iVFy^|y"', X, Go) —>■ 1 as n —>■ oo for model selection consistency. 
Similarly, it is clear that the predicted value under assumption {A3) can be expressed as 



Let Cl,... ,Cqo be the disjoint cliques corresponding to the true graph Gq. Under our ap¬ 
proach, the posterior probabilities P{M^\Y '^« P(M-y|y”',X,Go) as n —)■ oo, will only be 
positive for models belonging to F* = {7 : 7 = Ufcg 5 *{ 7 oj : j G Ck,'iS* C go}}, so that 


the true model dehned in (j^ and assumption {A2) belongs to F*. The proof of the 


variable selection consistency follows by Theorem 3 of Liang et. al (2008) by using clique 
indicators 7 Cj , J = 1, • • •, Q'o in place of variable inclusion indicators in their article. The proof 
for prediction consistency follows using variable selection consistency and using Theorem 4 
of Liang et. al (2008). 

A.3. Posterior Computation Steps for Variable Selection 

The computation strategy described in section 3 yields an estimate of the graph, which is 
used to inform the variable selection approach in the second step as described here. At a 
particular MCMC iteration, let 7 ’'“(j) denote the vector of current variable inclusion indica¬ 
tors having 7 ^ = 1 for all fc G Gj, and similarly denote 7 “ (j) as the vector of current variable 
inclusion indicators having 7 ^ = 0 for all k E Cj. Further let denote the model sizes 

corresponding to respectively. The Gibbs sampling alternates as follows 

p+ 

Step 1: Sample Ja = 1, • • • ,P, where 

Pi +Pi 


with 



‘1 


2 
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with 2 -P"i(-) being the Gaussian hypergeometric function (computed using the ‘BAS’ package in R), 
is the ordinary coefficient of determination of regression model, and q is the number of cliques. 
Step 2: Sample fixed effects using 7r(/3^|—) = N ^^ 

Step 3: Sample the residual precision using 7r(77|—) = Ga{nl2 + — 1^^+(3^+Y'{Y'^ — 


I 


X^+/3-y+)/2 + bri). 

Step 4: Sample clique prior inclusion probabilities using /(7r|—) = Reta(g*+ a^,g —g*+ 6^) , where 
q* is the number of cliques selected using Step 1. 

Step 5: Sample g using the fact that a hyper-^i prior on g with a = 4 is equivalent to a uni¬ 
form prior on g' , where g' = For sampling g' , we use a discretized grid of 1000 grids point 

from 0.01 to 0.999 with the posterior probability for grid-point g* being computed as 7r{g'\—) oc 
_ ^ 

^ exp I~ 2 g ' | • We then use the transformation g = to obtain g. 


A. 4. FDR based Criteria for Inclusion: We briefly discuss the FDR based criteria 
used in section 6.2 to determine the cut-off on posterior probabilities, as described in Bal- 
adandayuthapani et al. (2010). In particular, we can choose a threshold 0^ for posterior 
probabilities so as to control the average Bayesian FDR at level a, which essentially implies 
that we expect 100 q;% of the significant markers to be false positives. To obtain such an 
estimate, first sort the posterior probabilities for all markers in ascending order to yield 
= ^, ■■■,?■ Then </)„ = where C = max ; j*-^ Ej=iFRc) - “}• 
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Table 1: Simulations for Case 1(a), training sample size = 100, test sample size = 100. 




P 

=^4 





Method 

MSPE 

ROC 

PRC 

Pwr(l0% EDR) 

MS 

EP 

C 0 U 95 

BVS-SL 

1.102 

0.998 

0.999 

1.000 

10.368 

0.474 

0.923 

Prior Corr BVS-SL (k = 20) 

1.101 

1.000 

1.000 

1.000 

10.250 

0.250 

0.923 

PenCred 

1.122 

0.915 

0.926 

0.833 

8.200 

0.450 

0.914 

ssvs 

1.107 

0.960 

0.957 

0.895 

8.200 

0.300 

0.924 

Lasso 

1.157 

0.892 

0.907 

0.815 

11.400 

2.700 

0.912 

EL 

1.165 

0.899 

0.911 

0.820 

11.600 

2.850 

0.910 



P 

=40 





BVS-SL 

1.103 

0.997 

0.996 

1.000 

10.000 

0.444 

0.922 

Prior Corr BVS-SL (k = 20) 

1.100 

1.000 

1.000 

1.000 

10.250 

0.250 

0.921 

PenCred 

1.153 

0.890 

0.869 

0.802 

9.400 

1.700 

0.912 

SSVS 

1.123 

0.954 

0.921 

0.880 

7.650 

0.200 

0.917 

Lasso 

1.206 

0.894 

0.870 

0.815 

10.950 

2.900 

0.893 

EL 

1.218 

0.906 

0.874 

0.815 

11.150 

3.050 

0.893 



P^ 

=80 





BVS-SL 

1.093 

0.999 

0.996 

mjD 

9.800 

0.250 

0.928 

Prior Corr BVS-SL (k = 20) 

1.092 

1.000 

1.000 

1.000 

10.600 

0.600 

0.930 

PenCred 

1.270 

0.888 

0.790 

0.740 

13.700 

6.150 

0.880 

SSVS 

1.128 

0.965 

0.897 

0.895 

8.050 

0.750 

0.926 

Lasso 

1.288 

0.879 

0.770 

0.800 

10.700 

3.100 

0.883 

EL 

1.300 

0.890 

0.773 

0.790 

11.050 

3.450 

0.879 


MSPE: out of sample predictive MSE; Pwr(10% EDR) is sensitivity controlling for 90% specificity; MS: 
estimated model size; EP: false positives, and Covg^ is coverage under 95% predictive intervals. 


Table 2; Simulations for Case 1(b), training sample size = 100, test sample size = 100. 




P= 

=24 





Vlethod 

MSPE 

ROC 

PRC 

Pwr(l0% EDR) 

MS 

EP 

C 0 V 95 

BVS-SL 

1.055 

0.962 

0.964 

0.950 

7.850 

0.150 

0.939 

Prior Corr BVS-SL (k = 20) 

1.055 

0.969 

0.976 

0.950 

8.450 

0.050 

0.940 

PenCred 

1.057 

0.770 

0.802 

0.565 

5.600 

0.250 

0.942 

SSVS 

1.052 

0.820 

0.842 

0.645 

6.000 

0.200 

0.941 

Lasso 

1.168 

0.618 

0.677 

0.450 

7.200 

2.300 

0.907 

EL 

1.175 

0.638 

0.693 

0.460 

7.300 

2.300 

0.905 



P = 

=40 





BVS-SL 

1.108 

0.971 

0.947 

[7955 

9.100 

1.000 

0.924 

Prior Corr BVS-SL (k = 20) 

1.111 

0.987 

0.981 

0.985 

9.550 

0.650 

0.923 

PenCred 

1.156 

0.749 

0.679 

0.550 

6.200 

1.200 

0.912 

SSVS 

1.107 

0.842 

0.785 

0.695 

6.400 

0.650 

0.925 

Lasso 

1.251 

0.630 

0.563 

0.423 

8.650 

4.050 

0.885 

EL 

1.265 

0.648 

0.598 

0.465 

8.300 

3.450 

0.881 



P = 

=80 





BVS-SL 

1.089 

O.960 

0.927 

[7950 

8.200 

0.500 

0.922 

Prior Corr BVS-SL (k = 20) 

1.082 

0.975 

0.963 

0.965 

8.600 

0.400 

0.927 

PenCred 

1.212 

0.738 

0.561 

0.518 

9.450 

4.450 

0.888 

SSVS 

1.093 

0.863 

0.730 

0.725 

6.050 

0.600 

0.924 

Lasso 

1.290 

0.609 

0.473 

0.450 

6.100 

2.200 

0.861 

EL 

1.295 

0.628 

0.536 

0.530 

7.250 

2.750 

0.861 


MSPE: out of sample predictive MSE; Pwr(10% EDR) is sensitivity controlling for 90% specificity; MS: 
estimated model size; EP: false positives, and C 0 U 95 is coverage under 95% predictive intervals. 


39 





Table 3: Simulations for Case 1(c), training sample size = 100, test sample size = 100. 




P 

=^4 





Method 

MSPE 

ROC 

PRC 

Pwr(l0% EDR) 

MS 

EP 

C0Vg5 

BVS-SL 

1.070 

0.958 

0.941 

0.905 

7.789 

0.421 

0.923 

Prior Corr BVS-SL (k = 20) 

1.068 

0.981 

0.978 

0.975 

7.800 

0.250 

0.925 

PenCred 

1.090 

0.856 

0.897 

0.750 

7.450 

0.400 

0.922 

ssvs 

1.080 

0.873 

0.899 

0.780 

7.200 

0.300 

0.923 

Lasso 

1.109 

0.852 

0.822 

0.760 

10.450 

2.450 

0.914 

EL 

1.101 

0.885 

0.903 

0.810 

10.400 

2.150 

0.915 



P 

=40 





BVS-SL 

1.071 

0.969 

0.929 

0.965 

7.941 

0.529 

0.936 

Prior Corr BVS-SL (k = 20) 

1.066 

0.986 

0.978 

0.980 

8.000 

0.100 

0.936 

PenCred 

1.118 

0.879 

0.860 

0.780 

9.150 

1.750 

0.921 

SSVS 

1.086 

0.880 

0.894 

0.825 

7.150 

0.250 

0.932 

Lasso 

1.142 

0.857 

0.787 

0.808 

11.550 

3.550 

0.918 

EL 

1.145 

0.895 

0.847 

0.820 

11.250 

2.950 

0.917 



P^ 

=80 





BVS-SL 

1.092 

0.951 

0.927 

0)935 

8.000 

0.316 

0.928 

Prior Corr BVS-SL (k = 20) 

1.084 

0.992 

0.988 

0.990 

7.650 

0.050 

0.928 

PenCred 

1.253 

0.867 

0.770 

0.735 

12.650 

5.750 

0.889 

SSVS 

1.109 

0.895 

0.796 

0.745 

6.600 

0.200 

0.921 

Lasso 

1.179 

0.814 

0.729 

0.695 

10.650 

3.800 

0.909 

EL 

1.189 

0.861 

0.770 

0.738 

9.900 

3.100 

0.910 


MSPE: out of sample predictive MSE; Pwr(10% EDR) is sensitivity controlling for 90% specificity; MS: 
estimated model size; EP: false positives, and Covg^ is coverage under 95% predictive intervals. 


Table 4: Simulations for Case 1(d), training sample size = 100, test sample size = 100. 




P = 

=24 





Vlethod 

MSPE 

ROC 

PRC 

Pwr(l0% EDR) 

MS 

EP 

C0Vg5 

BVS-SL 

1.068 

0.839 

0.846 

0)675 

6.650 

0.500 

0.930 

Prior Corr BVS-SL (k = 20) 

1.066 

0.823 

0.845 

0.705 

6.350 

0.450 

0.931 

PenCred 

1.086 

0.745 

0.770 

0.557 

5.800 

0.850 

0.923 

SSVS 

1.065 

0.810 

0.810 

0.560 

5.800 

0.500 

0.932 

Lasso 

1.159 

0.551 

0.606 

0.348 

6.450 

2.650 

0.906 

EL 

1.167 

0.576 

0.633 

0.353 

5.700 

1.950 

0.906 




=40 





BVS-SL 

1.072 

0.922 

0.851 

D ) M 5 

6.850 

0.850 

0.934 

Prior Corr BVS-SL (k = 20) 

1.065 

0.893 

0.857 

0.825 

6.250 

0.400 

0.935 

PenCred 

1.126 

0.691 

0.622 

0.480 

6.450 

1.700 

0.915 

SSVS 

1.071 

0.855 

0.767 

0.645 

5.500 

0.500 

0.933 

Lasso 

1.218 

0.556 

0.475 

0.340 

5.000 

1.750 

0.891 

EL 

1.225 

0.594 

0.516 

0.390 

5.100 

1.550 

0.893 




=80 





BVS-SL 

1.109 

0.937 

0.788 

0 ) R 65 

5.900 

0.900 

0.919 

Prior Corr BVS-SL (k = 20) 

1.105 

0.894 

0.798 

0.815 

5.800 

0.400 

0.918 

PenCred 

1.226 

0.633 

0.419 

0.358 

7.850 

4.150 

0.897 

SSVS 

1.124 

0.840 

0.655 

0.628 

4.600 

0.400 

0.914 

Lasso 

1.239 

0.509 

0.354 

0.320 

4.650 

1.700 

0.886 

EL 

1.248 

0.596 

0.406 

0.355 

4.800 

1.700 

0.881 


MSPE: out of sample predictive MSE; Pwr(10% EDR) is sensitivity controlling for 90% specificity; MS: 
estimated model size; EP: false positives; Covgg: coverage under 95% predictive intervals. 
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